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L&)    2"  ABSTRACT 

It  is  shown  that  n(N),  the  average  number  of  nodes  in  an 
N-key  random  3-2  tree,  satisfies  the  inequality  0.70N-e  <n(N)  <0.79N+e. 
A  similar  analysis  is  done  for  general  B-trees.  It  is  shown  that 
storage  utilization  is  essentially  2m2   «  69$  for  B-trees  of  high 
orders . 


1.  Introduction 

Balanced  tree  structures  are  often  used  in  the  organization  of 
information.   One  attractive  scheme,  called  "3-2  trees, "  was  introduced 
by  J.  Hopcroft  [l]  [4].   Some  interesting  questions  concerning  3-2  trees 
have  been  raised  [3]  [k-].      In  this  paper  we  present  a  partial  solution  to 
a  problem  posed  in  [3]» 

A  3-2  tree  is  a  tree  in  which  every  internal  node  contains  either 
1  or  2  keys,  and  all  the  leaves  are  at  the  same  level  (see  Figure  l).   In 
drawing  3-2  trees  we  shall  adopt  the  notation  used  in  [3]«   Thus  keys  are 
represented  by  dots  inside  a  node  as  we  shall  only  be  interested  in  the 
structure  of  the  trees. 

To  put  a  new  key  into  a  node  that  contains  only  one  key,  we 
simply  insert  it  as  the  second  key.   If  the  node  already  contains  2  keys, 
we  split  the  node  into  two  nodes  containing  respectively  the  minimum  and 
the  maximum  of  the  three  keys,  and  insert  the  middle  key  into  the  parent 
node  by  repeating  the  process.  When  there  is  no  node  above,  a  new  root 
node  will  be  created  to  hold  the  middle  key. 

Consider  a  3-2  tree  T  with  j-1  keys  in  it.   These  j-1  keys 
divide  all  possible  key  values  into  j  intervals.  The  insertion  of  a  new 
key  K.  into  T  is  said  to  be  a  random  insertion  if  K.  has  equal  probabilities 
for  being  in  any  one  of  the  j  intervals  defined  above. 

Now  consider  the  building  of  3-2  trees  by  successive  random 
insertions.  The  average  cost  involved  is  dependent  on  the  specific 
implementation  of  the  insertion  algorithm.  There  are,  however,  certain 
quantities  that  are  useful  in  general  for  the  analysis.   One  quantity  of 
interest  is  n(N),  the  average  number  of  internal  nodes  in  a  3-2  tree  after 
N  keys  have  been  randomly  inserted  into  the  empty  tree.   In  this  paper  we 


Figure  1   A  3-2  tree  with  11  keys 


Figure  2   The  two  types  of  3-2  trees  of  height  1 


shall  derive  bounds  on  n(N)  and  on  the  corresponding  quantity  for 
B-trees  (see  Section  3)»  A  systematic  procedure  for  deriving  improved 
bounds  is  discussed,  but  the  computation  involved  appears  to  be 
prohibitive.  Main  results  are  contained  in  Theorems  2.6,  2.13,  and  3»1» 

2.  Number  of  Nodes  in  3-2  Trees 

Let  T  be  any  3-2  tree.  We  shall  use  n(T)  to  denote  the  number 
of  internal  nodes  of  T.   Let  fN(T)  be  the  probability  that  T  will  result 
after  N  random  insertions  are  made  to  the  empty  tree.   Obviously  fN(T)  is 
zero  unless  T  contains  exactly  N  keys.   In  terms  of  fN(T)  and  n(T),  the 
average  number  of  nodes  n(N)  defined  in  Section  1  can  be  expressed  as 
follows : 

n(N)  =  Z  n(T)fN(T)  (l) 

To  derive  bounds  on  n(N),  we  observe  that  most  of  the  internal 
nodes  of  any  3-2  tree  appear  on  the  lowest  few  levels.   Therefore,  a 
good  estimate  of  n(N)  can  be  obtained  by  analyzing  the  number  of  internal 
nodes  in  those  levels  of  a  random  3-2  tree.  We  shall  carry  out  the 
analysis  for  the  lowest  level  first  in  the  next  subsection,  and  then 
take  the  second  lowest  level  into  account  in  Section  2.2. 

2.1  First  Order  Analysis 

As  shown  in  Figure  2,  there  are  two  types  of  3-2  trees  of 
height  1.   The  type  13-2  tree  contains  1  keys  and  the  type  2  3-2  tree 
contains  2  keys.  An  arbitrary  3-2  tree  T  is  said  to  be  of  class  (l;x^,x  ) 
if  among  the  lowest  hight  1  subtrees  of  T,  x,  of  them  are  of  type  1  and 
Xg  are  of  type  2.  The  3-2  tree  shown  in  Figure  1  is  of  class  (l;3,2).   Let 
T  be  an  N-key  3-2  tree  of  class  (l;x,, x  ),  the  following  lemmas  are  easy 
to  obtain: 


Lemma  2.1:  2^  +  3^  =  N+l 

Proof:   Both  KH-1  and  2x  +  3x^  are  equal  to  the  number  of  leaves  of  T.   Q 

Lemma  2.2:  |(X;L  +  x2)  -  |  S  n(T)  =g  2(Xl  +  x2)  -  1 

Proof:  There  are  x.  +  x  -  1  keys  contained  in  the  internal  nodes  above 

the  lowest  level.   Thus  the  number  of  nodes  above  the  lowest  level, 

n(T)  -  (x1  +  x2),  satisfies  ^x^+Xg-l)  ^  n(T)  -  (x1+x2)  ^+^-1. 

Lemma  2.2  follows.  □ 

Definition  2.3:   Let  ^(x  ,x£)  be  the  set  of  3-2  trees  of  class  (ljx^x^). 

Define 

Definition  2.k:     A.  (N)   dlf  v  Zv     x    •  P   (x  ,x   )        i  =  1,2. 

1  X-.  y    X/-J       X         XV  J-  C-. 


P  (x  ,xp)  is  obviously  the  probability  for  a  random  H -key  3-2 
tree  to  be  of  type  (l^,,^),  and  A.  (l)  is  the  average  value  of  x.  for 
random  N-key  3-2  trees. 

Lemma  2.3:   |(A1(N)  +A2(N))  -  |  <   n"(N)  <   2(A1(w)  +A2(N))  -  1. 
Proof:   This  follows  from  Lemma  2.2  and  the  definitions  of  n(N),  A  (n), 

a2(n).  a 

Lemma  2.6:  A  (N)  =  |(N+l),  A  (n)  =  ^(N+l)  for  N  ^  6. 

Proof:   Let  T  be  an  (N-l)-key  3-2  tree  of  class  (l;x_,x  ).   By  making  a 
random  insertion  to  T,  we  will  obtain  a  3-2  tree  either  of  class 
(l;x1-l,x +l)  or  of  class  (l;xn+2,xp-l).   The  former  situation  happens, 
with  probability  2x^/N,  when  the  new  key  is"  inserted  into  a  subtree  of 
type  1.   Thus  we  have 


VM)    ■  xjx2   W*!'^ 


2xn  2x 


xJx2PN-i(xrx2)(xi  +  ir  +  2) 


i(x1-l)  +  (l--1i)(x1+2) 

-6x. 


(l-f^N-l)    +2  _  (2) 


With  initial  condition  A   (l)   =   1,    it  is   easy  to  show  from   (2)   that 


2, 


A1(N)  =  |(N+1)    f or  N  i?  6  (3) 

Lemma  2.1  implies  2A  (N)  +  3A  (N)  =  N+l.   This  and  (3)  give 


A  (N)  =  y(N+l)    for  N  ^  6 


D 


Lemma  2.5  and  2.6  lead  immediately  to  the  following  theorem: 
Theorem  2.6:  ^-N  +  y  =g  n(W)  «  In  -  ^   for  N  ^  6. 
Corollary:   0.6^N  g  n(N)  g  0.86N  for  N  £  6. 

N   — /  \ 
The  above  bounds  should  be  compared  with  the  obvious  bounds  —  S   n(NJ  g  N, 

which  can  be  regarded  as  the  zero   order  approximation  of  n(N). 

2.2  Second  Order  Analysis 

Better  bounds  for  n(N)  can  be  derived  by  considering  the  internal 
nodes  on  the  lowest  2  levels  of  3-2  trees.   Let  us  divide  all  3-2  trees  of 
height  2  into  9  types  as  shown  in  Figure  3*   For  any  3-2  tree  T  with  no  fewer 
than  3  keys,  we  can  classify  T  by  its  height  2  subtrees.  We  shall  say 
that  T  is  of  class  (2;x..,x  ,  . .  .,x  )  if  there  are  x.  height  2  subtrees  of 
type  i  for  each  i  (Figure  h) .      Let  T  be  an  N-key  3-2  tree  of  class 
(2;x..,x  ,  . .  .  ,x  ) .   The  following  two  lemmas  are  easy  to  prove. 


Type  1 


Type  2 


Type  3 


Type  k 


Type  5 


Type  6 


Type  7 


Type  8 


Type  9 


Figure  3   There  are  9  types  of  3-2  trees  of  height  2 
(leaves  not  shown). 


Figure  k       A  3-2   tree  of  class    (2;1,2,  0,0,  0,0, 1,0,0) 
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Lemma  2.7:  hx±  +  5^  +  6x_  +  6x^  +  7x  +  7xg  +  8x^  +  8xg  +  9x^  =   N+l. 

Proof:   Similar  to  the  proof  of  Lemma  2.1.  Q 

7  5       o  9       1    ,mN   ,   3         9 

Lemma  2.8:  jr    E  x.  +  £  E  x.  -  5-  <   n(T)  S  1»-  e  x.  +  5  Z  x.  -  1. 

2  i=l  X   2  i=fc  X   2  i=l       i=4 

Proof:   Similar  to  the  proof  of  Lemma  2.2.  O 

In  analogy  with  the  notation  P  (x-.,Xp)  defined  in  Section  2.1, 

we  use  P  (2;x.,,x  ,  . .  .,xn)  to  denote  the  probability  for  an  N-key  random 

3-2  tree  to  be  of  class  (2;x  ,x  ,  . .  .,x  ) .   For  each  i(l  g  i  £  9),  define 


Ll'~2 


A.(N) 


1,.?.,Xq  XiPN(2;Xl'  '"'V- 


Lemma  2.9:   J  E  A. (N)  +  §  E  A  (N)  -  \  g  n(N)  ^  4  z  A  (N)  +  5  E  A  (N)  -  1. 
2  i=l  X  d   i=4  X      d  i=l  X       i=4  1 

Proof:  Use  Lemma  2.8  and  definitions  of  A. (N),  n(N).  D 

We  shall  study  the  values  of  the  A. (N)'s.   Once  these  numbers 

are  known,  Lemma  2.9  determines  n(N)  to  within  15$. 

Consider  any  (N-l)-key  3-2  tree  T  of  class  (2;x  ,x  ,  . .  .,x  ) . 

By  examining  the  insertion  process,  it  can  be  seen  that  there  are  17 

classes  of  trees  that  T  might  become  upon  the  random  insertion  of  a  key. 

These  17  possible  classes  together  with  their  probabilities  of  occurrence 

are  tabulated  "in  Table  1.  Recurrence  relations  for  the  A.(N)'s  can  be 

1 

obtained  from  Table  1  as  in  Section  2.1.   For  example,  it  is  easy  to  show 
that 

v  T         ^xn  5xq  3xA 

Ai(N)  =  x.'s  W25^->Vh+in-1)+ir'-  (2)+n    (2) 

6x_  6xn  6x 

+  — i.  (1)  +  —2.(1)  +  —2 . rr 

=  A1(N-1)  +|(-U1(N-1)  +6A   (N-l)  +6A6(N-1)  +  6X,(N-l)  +  6Aq(N-1) 
+  6A9(N-1)). 


*i 

X2 

"3 

xi 

X5 

H 

-7 

-8 

x' 

9 

Probability 

Vl 

Xg+1 

*3 

\ 

X5 

x6 

*r 

*8 

x9 

4-3L/R 

*L 

Xg-1 

x,+l 

\ 

X5 

x6 

*7 

x8 

x9 

axg/N 

*1 

x2-l 

x3 

V1 

x5 

x6 

*7 

*8 

x9 

3^/N 

Xl 

*2 

x5-l 

\ 

X    +1 

x6 

*7 

*8 

x9 

6x^/N 
y 

xl 

*2 

*3 

V1 

X5+1 

x6 

"7 

"8 

x9 

kxk/U 

Xl 

*2 

x3 

Vi 

X5 

Xg+l 

\ 

x8 

x9 

2x^/N 

xl 

X2 

x3 

\ 

x  -1 
5 

x6 

x^+1 

*8 

x9 

2x5/N 

Xl 

*2 

x3 

\ 

x5-l 

x6 

*r 

Xg+l 

x9 

2x  /N 

x1+2 

X2 

*3 

\ 

V1 

x6 

*7 

*8 

x9 

3x5/N 

*1 

% 

*3 

\ 

X5 

V1 

xu+l 

x8 

x9 

ifx6/N 

x  +2 

X2 

x3 

\ 

x5 

V1 

"7 

x8 

X9 

3x6/N 

Xl 

X2 

*3 

\ 

X5 

x6 

Xj-1 

x8 

V1 

2x?/N 

XL+1 

Xg+l 

*3 

\ 

X5 

x6 

3U-1 

x8 

x9 

6^/N 

Xl 

X2 

"3 

\ 

X5 

x6 

*7 

Xg-l 

xg+l 

2x8/N 

X    +1 

x2+l 

*3 

\ 

X5 

x6 

*T 

Xg-1 

x9 

6x8/N 

V1 

*2 

X..+1 

\ 

X5 

*6 

*? 

x8 

v1 

6x9/N 

*1 

x2+2 

*3 

\ 

x5 

x6 

*7 

x8 

V1 

3x9/N 

Table  1   Transition  under  a  random  insertion: 
a  tree  of  class  (2;xn,x 0, . .  .,x  ) 

becomes  a  tree  of  class  (2;x',x',  . .  .,x* ). 

Each  row  gives  the  values  of  x',x*  ...,x' 

for  a  possible  resulting  class  with  its 
probability  of  occurrence  in  the  last 
column. 
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Similar  formulas  for  A  (N),  . . .,A  (N)  can  also  be  derived.  These 
relations  can  be  compactly  written  in  the  following  form:  Let  A(n)  be 
the  9-component  column  vector  (A  (N)),  then 


A(N)  =  (I+|D)A(N-1) 


where  I  is  the  9x9  identity  matrix  and  D  is  given  by 


(k) 


k 

0 

0 

0 

6 

6 

6 

6 

6 

k 

-5 

0 

0 

0 

0 

6 

6 

6 

0 

2 

-6 

0 

0 

0 

0 

0 

6 

0 

3 

0 

-6 

0 

0 

0 

0 

0 

0 

0 

6 

k 

-7 

0 

0 

0 

0 

0 

0 

0 

2 

0 

-7 

0 

0 

0 

0 

0 

0 

0 

2 

1+ 

-8 

0 

0 

0 

0 

0 

0 

2 

0 

0 

-8 

0 

0 

0 

0 

0 

0 

0 

2 

2 

-9 

(5) 


To  solve  A(n)  from  (k),   we  define  a  9-component  column  vector 
a(N)  =  (a.(N))  by 


a±(N)  =  Ai(N)/(N+l) 


In  terms  of  a(N),  (h)   can  be  written  as 


a(N)  =  (l+^-(D-l))a(N-l) 


(6) 


(7) 


To  solve  (7),  the  following  two  lemmas  will  be  useful.   Since  recurrence 

relation  of  the  form  (7)  has  been  studied  in  some  other  context  [5],  we 

shall  omit  the  proofs  here. 

Lemma  2.10:   Let  G  be  a  p  X  p  real  matrix  with  simple  eigenvalues 

X^X^.X^,  . .  .,X  .  where  \_  =  0  and  ReX  ,  g  ReX  „  g  ...  g  Re\n  <  0. 
0'  L'    2      p-1        0  p-1      p-2  1 

If  v(l),v(2),  . . .,v(N), . . .  is  a  sequence  of  p-component  vectors  satisfying 
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v(j)  =  (l +t-tj-G)  v(j-l),  then  there  exists  a  vector  u  such  that 

(i)  Gu  =  0 

Re\1 
(ii)   |(v(N)).  -  u.  I  <  CN      for  some  constant  C  and  all  i,N. 

where  (v(N)).,u.  denote  the  ith  component  of  v(N)  and  u  respectively. 
Corollary  :  v(N)  -*  u  as  N  -*  ». 

Lemma  2.11:  Let  k  be  a  positive  integer,  then  the  polynomial 

I  I 

g(\)  =  II  (\+k+j )  -  II  (k+j )  has  only  simple  roots.  Furthermore, 
j=0         j=0 

the  real  parts  of  all  the  roots  except  the  root  X  =  0  are  negative. 

We  now  turn  to  the  determination  of  a(N)  by  (7)«  An  explicit 
calculation  shows  that  the  characteristic  polynomial  of  D-I  is 
-\(\+T)(A.+8)(x+9)((^+9)5+25(x+9)5+210(x+9)2+136(\+9)+38^).  The  roots  of  the 
polynomial,  which  are  eigenvalues  of  D-I,  are  0,  -6.55-6.25i, 
-1,    -8,  -9,    -9«23±1.37i,  -13. Mu  Thus,  D-I  satisfies  the  conditions 
on  G  in  Lemma  2.10.  Therefore,  there  exists  a  vector  u  =  (u. )  such  that 

la.  (N)  -u.  I  <  C^N   55    for  some,  constant  C^         (8) 
1  l      i  '    0  0  v  ' 

where  u  satisfies 


(D-I)u  =  0  (9) 


In  terms  of  the  u.'s,  we  can  express  Lemma  2.9  as  follows: 
Lemma  2.12: 

(N+1)(J  e  u.  +  |  Z  u.)  -i-CN"5'55gn(N)g  (N+l)(^  e  u.  +  5  Z  u.  )  -1  +  CN-5*55 
2i=l  x  2  ±=14-  x   2  i=l  x        i=k   1 

for  some  constant  C. 

Proof:  From  equations  (6)  and  (8),  we  obtain 

|A  (N)  -  (N+l)u±|  <  |  N"5*55  for  some  constant  C.        (10) 
The  lemma  follows  immediately  from  (10)  and  Lemma  2.9.  D 
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Now,  to  find  the  values  of  the  u. *  s,  we  observe  that 
equation  (9)  determines  u  up  to  a  constant  factor.  The  normalization 
constant  can  be  determined  as  follows:  Lemma  2.7  and  equation  (6) 
lead  to  the  equation 

l+a1(N)+5a2(N)+6a3(N)46a1+(N)+7a5(N)+Ta6(N)+8a7(N)+8a8(N)+9a9(N)  =  1   (ll) 


Since  a.  (n)  -*  u.  as  N  -*  »,  (11)  implies  that 

ifu1+5u2+6u5+6u]++7u  +7u6+8iu+8ug+9u  =  1 ,  ( 12 ) 


which  is  the  equation  we  need  in  order  to  determine  the  normalization 
constant.  Therefore,  solving  equations  (9)  and  (12),  we  obtain: 

uu  =  41J+/7991  =  0.052 
u2  =  396/7991  =  0.050 
u3  =  912/55937  =  0.016 
u,  =  1188/55937  =  0.021 

u5  =  1278/55937  =  0.023  (13) 

u6  =  297/55937  =0.005 
1^  =  U6/55937  =  0.007 
u8  =  284/55937  =  0.005 
■  uQ  =  20/7991  =  0.003 

Substituting  the  values  of  the  u. '  s  into  the  inequality  in  Lemma  2.12, 
we  obtain  the  main  result  of  this  section. 

Theorem  2.13:   0.70N+0.2-C  N~5*  55  ^  n(N)  £  0.79N-0.2+C  N~5'  55  for  some 
constant  C. 
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The  technique  used  in  this  subsection  can  be  used  to  compute 
higher  moments  of  the  number  of  nodes  n(T).  For  example,  we  can  set 
up  a  system  of  recurrence  relations  of  the  form  of  equation  (k)   for  the 
quantities  A.  .  (n)  where 

A.  .(N)  =     2     P.T(2;xn,...,xn)x.x.    i,j  =  1,2,. ..,9  . 
ijv     xx,  ...,x   IT  '  1'       '   9  i  j  ' 


Determination  of  the  A. . (n)'s  will  then  lead  to  (by  Lemma  2.8)  bounds 

J-  J 

on  the  average  value  of  n(T)  for  N-key  random  3-2  trees. 


2.3  Higher  Order  Analysis 

The  methods  used  in  the  previous  two  subsections  obviously 
can  be  generalized  to  obtain  better  approximation  of  n(N).  By  computing 
the  average  number  of  nodes  in  the  lowest  k-levels,  we  can  determine 

n(N)  to  an  accuracy  of   l/2(2  -l)  X  100$.   This  is  so  because  at  most 

k 
1/2   of  the  keys  are  in  nodes  above  the  lowest  k  levels. 

This  general  procedure,  however,  is  not  very  useful  in 

practice.  If  F(k)  is  the  number  of  different  types  of  trees  of  height  k, 

then  solution  of  this  problem  involves  the  manipulation  of  an  F(k)  x  F(k) 

matrix.  It  is  not  difficult  to  show  that  F(k)  =  |-F(k-l)(F(k-l)+l)2. 

We  have  analyzed  the  cases  k  =  1,  2,  where  F(l)  =2  and  F(2)  =  9» 

However,  F(3)  =  ^50  is  already  such  a  large  number  that  carrying  out 

the  computation  appears  to  be  very  difficult. 

3.  An  Analysis  of  B -trees 
3^1  Introduction 

A  natural  extension  of  3-2  trees  is  the  idea  of  "B-trees"  [2] 
[k].     A  B-tree  of  order  m  is  a  tree  in  which  the  number  of  keys  contained 


Ik 

in  any  internal  node  other  than  the  root  is  no  greater  than  m-1  and 
no  less  than  |"m/2]-l.  3-2  trees  are  just  B-trees  of  order  3«  To  add 
a  key  to  a  node,  we  insert  the  new  key  into  the  other  keys  and  check 
if  the  node  now  contains  more  than  m-1  keys.   If  the  answer  is  no, 
the  insertion  has  been  completed.  Otherwise,  we  split  the  node  into  2 
nodes,  one  of  which  contains  the  smallest  f"m/2]-l  keys  and  the  other 
the  m-fm/2]  largest  keys,  the  one  remaining  key  is  then  inserted  into 
the  parent  node.  Random  B-trees  are  defined  in  exactly  the  same  way 
as  random  3-2  trees  are  defined. 

We  shall  study  n  (N),  the  average  number  of  nodes  in  the 
B-trees  of  order  m  resulting  from  N  random  insertions.  An  obvious 
bound  was  given  in  [k]: 

-,    n  (N)      n      , 

-i-  *  -^—  *   1 +  i  (lk) 

m-1  -   N   -  rn/2l-l   N 

In  this  section  we  shall  consider  the  nodes  at  the  lowest  level,  and 
do  an  analysis  similar  to  the  first  order  analysis  done  in  Section  2.1. 
As  we  shall  see,  this  analysis  yields  better  results  than  the  corresponding 
analysis  for  3-2  trees.  This  is  so  because  a  greater  proportion  of  keys 
in  a  B-tree  are  stored  in  the  lowest  internal  nodes  as  m  becomes  larger. 
Define  the  following  functions: 

H(N)  =   Z  t  for  N  II 

k=l  k 


(HOiO-Hdii^))"1  if  m=  even 


•(m)  ( 


m+1 


d=f  mTl  (Hda+D-HUm+Dyte))"1      if  m  =  odd 
v. 
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It  is  well  known  [6]  that  H(m)  ~  &nm   +  0. 58  +  —  +...  .  A  simple 


2m 


computation  shows  that  r(m)  =  -77  +  0(-rr).   Our  new  bounds  on 


n  (N)  are  given  below: 


Theorem  3«ls  For  any  e  >  0  and  fixed  m, 
n  n"  (N)         , 

111-1  N         f  m/2  ]  -1 


when  N  is  sufficiently  large. 


ffm«   1  1  ,  ,  C 


Corollary:  For  any  e  >  0  and  fixed  m,  |  — - -s-rr  \   <  -^  +  e   for  all 

sufficiently  large  N,  where  C  is  a  constant  independent  of  m  and  N. 

The  corollary  follows  from  Theorem  3»1  and  the  approximation  of  r(m) 
given  earlier.  If  all  the  nodes  in  a  B-tree  of  order  m  contain  m-1 
keys,  there  would  be  N/(m-l)  nodes.  The  ratio  N/(m-l)n  (n)  can 
therefore  be  viewed  as  storage  utilization  [k-].     Our  corollary  to 
Theorem  3»1  shows  that,  as  N  becomes  large,  the  storage  utilization 
is  essentially  5m2.   ~  O.69  for  fixed  large  m  (cf.  equation  (l4)). 

3.2  Proof  of  Theorem  3.1 

We  will  first  introduce  some  notations.  Note  that  there  are 

m-[m/2]+l  types  of  B-trees  of  order  m  and  height  1.  As  shown  in 

Figure  5>  a  "type  i  B-tree  contains  i  keys  in  its  node  for 

i  =  ["m/2]-l,  [m/2 ],..., m-1.  A  B-tree  of  order  m  is  said  to  be  of  class 

(yr  /^i  -,»yr  /rt-i»«»»*y  -,)  if  at  the  lowest  level  there  are  y.  subtrees 
w  |m/2  \-V J  |m/2]*   ,Jm-ly  Jx 

of  type  i  for  each  i.  Let  Vy[m/2l-l'yf"m/2l' * ' ''^m-l^  be  the 

probability  for  a  random  N-key  B-tree  of  order  m  to  be  of  class 

(yfm/2l-l'""'*ym-l?" 
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Type  fm/21-1 


Type  fm/2] 


Type  m-1 


Figure  5   The  height  1  B-trees  of  order  m 
consist  of  m-["m/2l+l  types. 
(Shown  for  m  =  10) 
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def 


Definition  ?.2:  A.(N)  =  ^  y^fy^,^  . .  .,7^) 

J 

i  =  [m./2~|-l,  . .  .  ,m-l  . 

For  brevity,  we  have  suppressed  the  dependence  of  A.  (N)  and  P  on  m 
in  our  notations. 

1      m"1  1 

Lemma  5.3;   (l  +  ~r)     E     A.  (W)  -  -~  g  n  (N) 

^      ^  i=|-m/2l-l  x      m"1    m 

m-1 

g  (1+ * )     E     A.  (N) = +1 

[m/21-1  i=[m/2]-l  X     [m/21-1 

Proof:  Similar  to  the  proof  of  Lemma  2.1.   The  term  +1  appearing  on 

the  right-hand  side  of  the  equation  arises  from  the  fact  that  the  root 

may  contain  less  than  [m/2]-l  keys.  □ 

The  major  effort  to  prove  Theorem  3*1  is  contained  in  the  next  Lemma. 

-,  f        m-1 
Lemma  3-h:      Let  g(N)  J      E    A.  (N)/(N+l).  Then  for  any  e  >  0, 

i^m/21-11 

|g(N)  -r(m)|   <  e  for  all  sufficiently  large  W. 

Proof:     We  shall  assume  m  =  2p  to  be  an  even  number.      The  proof  for 
odd  m  is   similar. 

Let  T  be  an   (N-l)-key  B-tree  of  order  2p.     After  a  random 
insertion,    T  may  become  a  B-tree  of  class    (y       ,...,y.      -l,y.+l,  . .  ,,y         ) 
with  probability     iy       /n  (for  each  i  =  p, ...,2p-l)   or  it  may  become  a 

B-tree  of  class    (yp_1+1>yp+1>yp+1>  •  •  ->72 p.^1)  ^-th  probability  2pyg   ^/N. 
It  follows  that 
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Vi(N)  =  Vl*"1'  "i^^p-i^-1^2^-!^-1^ 

A  (N)  =  A  (N-l)+fepA^   (N-l)-(p+l)A  (N-l)+2pA    (N-l)) 

P         P         IN    p-J-  P  e-r    x 


P 

and 


A.(N)  =  A.(N-l)+|(jA  1(W-l)-(j+l)A  (N-l)) 


for  p+1  g  j  g  2p-l 


(15) 


Denoting  by  A(N)  the  (p+1 ) -component  vector  (A.(N)),  (15)  can  be 

J 

written  in  matrix  notation  as 


A(N)  =  (I+|B)A(N-1) 


where  I  is  the  (p+l)  X  (p+l)  identity  matrix,  and  B  is  defined  by 


(16) 


B  = 


-P 

P   "(P+l) 
P+l 


-(p+2) 
p+2 


-(p+3) 


2p 

2p 
0 
0 


(IT) 


2p-l    -2p 

To  solve  (l6)  for  A(n),  we  define  a  (p+l) -component  vector  a(N)  =  (a. (N)) 
by 


a±(N)  =  Ai(N)/(N+l) 


Equations  (l6)  and  (18)  lead  to  the  following  recurrence  relation 


(18) 


a(N)  =  [l+pi  (B-l)]a(N-l) 


(19) 


19 
The  characteristic  polynomial  q(\)  of  B-I  is  computed  to  be 

q(\)  =  (-1)P  'L(\+2p+l)[Z  (\+p+j)-  Z  (p+j)] 

0=1        j=l 

From  Lemma  2.11,  it  is  easy  to  see  that  the  roots  of  q(\)  =  0  satisfy 
the  following  conditions : 

(i)  All  roots  are  simple  roots, 
(ii)  \  =   0  is  a  root, 
(iii)  The  real  parts  of  all  roots  except  \   =  0  are  negative. 

Therefore,  according  to  Lemma  2.10,  there  exists  a  vector 

T 
u  =  (up_1, u , . . ., u2p_1 )  such  that 

(i)   (B-I)  u  =  0  (21) 

-e 
(ii)   |a.  (N)  -u.  |  <  C  N  m  for  p-1  g  i  g  2p-l  (22) 

where  C  ,    e     are  positive  constants. 

(22)   implies 

2p-l  2p-l  _e 

|      Z       Ai(N)/(N+l)   -       Z       uJssC^N     m  (23) 

i=p-l  i=p-l 

Now,    to  determine  the  u. ' s,   we  note  that  the  following 

equation  can  be  proved  easily  (cf.   the  derivation  of  Equation  (12)): 

2p-l 

Z       (i+1)  u.    =   1  (2*0 

i=p-l  X 

Solving   (21)  and  (2k),   we  obtain 

-1 


%-l  =  ^l2^1    P(2p)-H(p)] 


p+1  2p+l 

(25) 
Ui  =  ife  172    [H(2p)-H(p)f  p^i*2p-l 
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2P-1        1  ml 

Therefore,  L       u  =  ^—r  (H(2p)-H(p))  x  =  r(m).  (26) 

i=p-l  1  P 

Finally,  substituting  (26)  in  (23),  the  lemma  is  obtained.         □ 

Proof  of  Theorem  3»1«  It  is  a  direct  consequence  of  Lemma  3*3  and 

3.*K  □ 


k.     Conclusion 

We  have  derived  bounds  on  the  average  number  of  nodes  in  an 
N-key  random  B-tree,  which  essentially  is  the  average  number  of 
"splitting"  in  building  the  tree.  One  interesting  result  is  that  the 
asymptotic  storage  utilization  is  approximately  Sn2   ~  69%  for  B-trees 
of  high  orders.  This  seems  to  agree  well  with  one  set  of  experimental 
data  (m  =  121,  N  =  5000,  storage  utilization  =  67$,  see  [k]). 

Many  problems  about  3-2  trees  remain  to  be  investigated. 
What  is  the  average  number  of  splitting  on  the  N^h  random  insertion 
[3]?  How  to  analyze  3-2  trees  when  deletions  are  also  present?  Some 
upper  bounds  can  be  obtained  for  the  former  problem  using  the  present 
approach,  but  it  appears  that  very  different  methods  would  be  required 
to  answer  these  questions  satisfactorily. 
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